176
Applications in Computer Vision
information discrepancy for distillation. γ controls the proportion of discrepant proposal
pairs, further validated in Section 6.5.4.
For each iteration, we first solve the inner-level optimization, that is, the selection of the
proposal, by exhaustive sorting [249]; and then solve the upper-level optimization, distilling
the selected pair, based on the entropy distillation loss discussed in Section 6.5.3. Consid-
ering that there are not too many proposals involved, the process is relatively efficient for
inner-level optimization.
6.5.3
Entropy Distillation Loss
After selecting a specific number of proposals, we crop the feature based on the proposals
we obtained. Most SOTA detection models are based on Feature Pyramid Networks (FPN)
[143], which can significantly improve the robustness of multiscale detection. For the Faster-
RCNN framework in this paper, we resize the proposals and crop the features from each
stage of the neck feature maps. We generate the proposals from the regression layer of the
SSD framework and crop the features from the feature map of maximum spatial size. Then
we formulate the entropy distillation process as follows.
max
Rs
n
p(Rs
n|Rt
n).
(6.87)
Here is the upper level of the bi-level optimization, where m is solved and therefore
omitted. We rewrite Eq. 6.87 and further achieve our entropy distillation loss as
LP (w, α; γ) = (Rs
n −Rt
n) + Cov(Rs
n, Rt
n)−1(Rs
n −Rt
n)2 + log(Cov(Rs
n, Rt
n)),
(6.88)
where Cov(Rs
n, Rt
n) = E(Rs
nRt
n) −E(Rs
n)E(Rt
n) denotes the covariance matrix.
Hence, we train the 1-bit student model end-to-end, the total loss for distilling the
student model is defined as
L = LGT (w, α) + λLP (w, α; γ) + μLR(w, α),
(6.89)
where LGT is the detection loss derived from the ground truth label, and LR is defined in
Equ. 6.80.
6.5.4
Ablation Study
Selecting the hyper-parameter. As mentioned above, we select hyperparameters λ, γ,
and μ in this part. First, we select μ, which controls the binarization process. As plotted in
Fig. 6.17 (a), we first fine-tune the hyperparameter μ controlling the binarization process
in four situations: raw BiRes18 and BiRes18 distilled by Hint [33], FGFI [235], and our
IDa-Det, respectively. In general, performance increases first and then decreases when the
value of μ increases. On raw BiRes18 and IDa-Det BiRes18, the 1-bit student performs best
when μ is set as 1e-4. And μ valued 1e-3 is better for the Hint and the FGFI distilled 1-bit
student. Therefore, we set μ as 1e-4 for an extended ablation study. Figure 6.17 (b) shows
that the performances increase first and then decrease with increasing λ from left to right.
In general, IDa-Det performs better with λ set as 0.4 and 0.6. With a variable value of γ,
we find {λ, γ} = {0.4, 0.6} boost the performance of IDa-Det most, achieving 76.9% mAP
on VOC test2007. Based on the ablative study above, we set the hyperparameters λ, γ,
and μ as 0.4, 0.6, and 1e-4 for the experiments in this chapter.
Effectiveness of components. We first compare our information discrepancy-aware (IDa)
proposal selecting method with other methods to select proposals: Hint [33] (using the neck
feature without region mask) and FGFI [235]. We show the effectiveness of IDa on two-
stage Faster-RCNN in Table 6.5. In Faster-RCNN, the introduction of IDa improves mAP